library(Rcpp)
library(ggplot2)
library(plotly)
library(data.table)
library(ggpubr)
library(vegan)
library(phyloseq)
# devtools::install_github("schuyler-smith/schuylR")
library(schuylR)
# devtools::install_github("schuyler-smith/ssBLAST")
library(ssBLAST)
setwd('~/google-drive/DARTE/')
sourceCpp('src/fasta_seq_names.cpp')
sourceCpp('src/fasta_merge.cpp')
sourceCpp('src/parse_blast.cpp')
sourceCpp('src/BLAST_db_match.cpp')
sourceCpp('src/process_BLAST.cpp')
metadata <- fread('files/DARTE-QM-v3-metadata-fromKathyMou.csv')
metadata[[1]] <- gsub('\\.', '_', metadata[[1]])
categories <- fread('files/category_all.tsv', header = FALSE)
colnames(categories) <- c('Source', 'ARG_Class', 'Primer')
This is created by averaging the number of reads from all samples from a single source, i.e. the average number of reads aligned from soil at each percent identity threshold.
This is the same information as the Percent Identity graph, but represented as the proportion of the total number of reads from the raw FASTQ files.
This is the same information as the Percent Identity graph, but represented as the proportion of the total number of aligned reads. I.E. at each increasing percent identity, the number of reads aligned is divided by the number of reads aligned at 60% (the total number of aligned reads).
98% Identity match for reads of length >= 100 bp seems to be the ideal threshold for number of reads captured compared to confidence in alignment. For subsequent analyses, I used that threshold for determining identified ARGs.
Each primer/ARG genes presented is expected to be found based on the mock.expected.fa file. 75% of the primers expected to bind genes in the mock samples were successful, and detected in the sequencing. 87% of the ARG genes expected to be found in the mock samples were found.
BLAST against mock.expected.fa, compare to mock.expected.fa. Looking to see how many of the known ARGs are detected. Something is wrong with Q3_Mock_0_025spike_C.
Same data, but by individual primer pressence.
spike_counts_table
## Sample Spike Spike_Level
## 1: Q3_Mock_0spike_A 0 0.0000
## 2: Q3_Mock_0spike_B 0 0.0000
## 3: Q3_Mock_0spike_C 1 0.0000
## 4: Q3_Mock_0_0025spike_A 20 0.0025
## 5: Q3_Mock_0_0025spike_B 11 0.0025
## 6: Q3_Mock_0_0025spike_C 12 0.0025
## 7: Q3_Mock_0_009spike_A 37 0.0090
## 8: Q3_Mock_0_009spike_B 70 0.0090
## 9: Q3_Mock_0_009spike_C 87 0.0090
## 10: Q3_Mock_0_025spike_A 223 0.0250
## 11: Q3_Mock_0_025spike_B 160 0.0250
## 12: Q3_Mock_0_025spike_C 3 0.0250
## 13: Q3_SCI_Manure_0_025spike_A 465 0.0250
## 14: Q3_SCI_Manure_0_025spike_B 598 0.0250
## 15: Q3_SCI_Manure_0_025spike_C 512 0.0250
## 16: Q3_Mock_0_12spike_A 633 0.1200
## 17: Q3_Mock_0_12spike_B 564 0.1200
## 18: Q3_Mock_0_12spike_C 977 0.1200
## 19: Q3_Mock_0_25spike_A 2030 0.2500
## 20: Q3_Mock_0_25spike_B 1686 0.2500
## 21: Q3_Mock_0_25spike_C 1054 0.2500
## Sample Spike Spike_Level
Distances were calculated from raw read counts to compare to how normalizing with relative abundance clusters replicates. Without housekeeping genes, we can’t directly compare samples to each other in terms of abundance or expression of certain genes. With relative abundance though, we should be able to compare the ratios.
Using relative abundance to normalize samples clusters the replicates tighter, indicating that it works well overall as a normalization method.